Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9610 / 000101_owner-urn-ietf _Thu Oct 24 15:38:28 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 6KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id PAA18135 for urn-ietf-out; Thu, 24 Oct 1996 15:38:28 -0400 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id PAA18126 for <urn-ietf@services.bunyip.com>; Thu, 24 Oct 1996 15:38:22 -0400 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA29063 (mail destined for urn-ietf@services.bunyip.com); Thu, 24 Oct 96 15:38:17 -0400 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <01446-0@josef.ifi.unizh.ch>; Thu, 24 Oct 1996 21:38:20 +0100 Subject: Re: [URN] UNICODE or not UNICODE? To: paf@swip.net Date: Thu, 24 Oct 1996 21:38:19 +0100 (MET) Cc: urn-ietf@bunyip.com, splinter@bunyip.com In-Reply-To: <v03007802ae94cf233d55@[192.71.220.137]> from "Patrik Faltstrom" at Oct 24, 96 11:08:19 am Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 4139 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..471:24.09.96.20.38.21"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com > >Terry wrote >> - why care? the NSS is supposed to be opaque >> - does this imply that a) NSSs should be formed originally >> in Unicode, or that b) NSSs in other coded character sets >> must be translated/transliterated into Unicode in forming >> URNs, or c) something else? > >The problem with not defining a character set is that it will >be impossible to do any comparison between two URNs. We need to have >some ability to do comparisons, and the only reasonable way of doing that >is to use _one_ character set. > >To answer the second question, you have to have a urn in a different >character set in some cases, for example in a client which does not >use UNICODE. You then have to do translation. > >What we did in Digger, the Whois++ server Bunyip has, was using the >rules for comparison (decomposition + sorting) and translation rules >that the UINCODE consortium have defined. > >The UNICODE tables include for each character one kind of equivalence >which is the rule for decomposition of that code point in UNICODE >into more than one other code point. One example, the letter '=C4'. > >In UNICODE, this character is defined as: > >00C4;LATIN CAPITAL LETTER A WITH DIAERESIS;Lu;0;L;0041 0308;;;;N; > LATIN CAPITAL LETTER A DIAERESIS;;;00E4; > >One can here see that the codepoint, U+00C4, is equivalent to U+0041 >followed by U+0308. One can also see that the lower case version of this >character is U+00E4 (among other things). > >When comparing two strings, and one of them include "U+00C4" >and the other one the sequence "U+0041"+"U+0308" these strings >should be considered equal in the sense of the UNICODE spec. > >Note that I am not talking about ISO-10646 here, as I am not at all >familiar with what parts of this is included in the 10646 spec. This >is UNICODE 2.0 we are talking about! ISO 10646 does not define any such equivalences, but also in other cases defines much less charcter properties and semantics. >This means that before comparing the strings that include "U+00C4" >and "U+0041"+"U+0308", all code points have to be decomposed into its >maximal decomposition possible. I.e. "U+00C4" have to be changed to >"U+0041"+"U+0308", and these have to be sorted (i.e. one can know >from the code tables that "U+0041" is to be before "U+0308" in >a composed character). Sorting really only occurs on diacritics; the base character (A or U+0041 here) is not involved. Sorting is necessary becaus you can have A with X on top and Y below, which you can encode as AXY or AYX. Sorting is restricted because you can have X and Z with both go on top, and in this case AXZ means Z atop X (atop A), and AZX means X atop Z, and these two are not equivalent. >THEN we do the comparison codepoint by codepoint. > >If we wanted to do case insensitive matching, we use the information >from the UNICODE consortium about what is a lower case character. Well, with case we get into problems, as the correspondences supplied by Unicode work for almost all languages, but have some exceptions (e.g. Turkish). But case equivalence is not needed in the general urn syntax. >>or use Unicode code points that >>indicate glyph variants of a letter, such as 06AA, "Arabic >>Letter Swash Kaf," which is lexically the same as 0643, >>"Arabic Letter Kaf`" or specify some ligatures). > >I don't know Arabic, but I am just following the rules that >UNICODE consortioum have set up, and according to these >rules, U+06AA and U+0643 is not equivalent characters >when comparing: > >06AA;ARABIC LETTER SWASH KAF;Lo;0;R;;;;;N;ARABIC LETTER SWASH CAF;;;; >0643;ARABIC LETTER KAF;Lo;0;R;;;;;N;ARABIC LETTER CAF;;;; > >I am not arguing if this descision by the UNICODE consortium >was correct or not, but _someone_ that they trusted must have >told them these are different characters. I don't know the use of swash kaf either, but I would assume that it is used as a special character in some languages, in some specific cases or to denote a different sound. This would mean that people who really want to use swash kaf in their unrs would not have a problem distinguishing it from the normal kaf. Regards, Martin.